Automatic Corpus Extension for Data-driven Natural Language Generation
نویسندگان
چکیده
As data-driven approaches started to make their way into the Natural Language Generation (NLG) domain, the need for automation of corpus building and extension became apparent. Corpus creation and extension in data-driven NLG domain traditionally involved manual paraphrasing performed by either a group of experts or with resort to crowd-sourcing. Building the training corpora manually is a costly enterprise which requires a lot of time and human resources. We propose to automate the process of corpus extension by integrating automatically obtained synonyms and paraphrases. Our methodology allowed us to significantly increase the size of the training corpus and its level of variability (the number of distinct tokens and specific syntactic structures). Our extension solutions are fully automatic and require only some initial validation. The human evaluation results confirm that in many cases native users favor the outputs of the model built on the extended corpus.
منابع مشابه
Automatic Generation of a Multi Agent System for Crisis Management by a Model Driven Approach
Considering the increasing occurrences of unexpected events and the need for pre-crisis planning in order to reduce risks and losses, modeling instant response environments is needed more than ever. Modeling may lead to more careful planning for crisis-response operations, such as team formation, task assignment, and doing the task by teams. A common challenge in this way is that the model shou...
متن کاملEvaluating a dialog language generation system: comparing the mountain system to other NLG approaches
This paper describes the MOUNTAIN language generation system, a fully-automatic, data-driven approach to natural language generation aimed at spoken dialog applications. MOUNTAIN uses statistical machine translation techniques and natural corpora to generate human-like language from a structured internal language, such as a representation of the dialog state. We briefly describe the training pr...
متن کاملConcordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms
In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...
متن کاملNeural Sentence Ordering
Sentence ordering is a general and critical task for natural language generation applications. Previous works have focused on improving its performance in an external, downstream task, such as multi-document summarization. Given its importance, we propose to study it as an isolated task. We collect a large corpus of academic texts, and derive a data driven approach to learn pairwise ordering of...
متن کاملAutomatic Tweet Generation From Traffic Incident Data
We examine the use of traffic information with other knowledge sources to automatically generate natural language tweets similar to those created by humans. We consider how different forms of information can be combined to provide tweets customized to a particular location and/or specific user. Our approach is based on data-driven natural language generation (NLG) techniques using corpora conta...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016